K.  LI,  S.  OH,  A.G.A.  PERERA,  Y.  FU:  VIDEOGRAPHY  ANALYSIS 


1 


A  Videography  Analysis  Framework  for 
Video  Retrieval  and  Summarization 


Kang  Li* *1 
kangli@buffalo.edu 
Sangmin  Oh*2 
sangmin.oh@kitware.com 
A.  G.  Amitha  Perera2 
amitha.perera@kitware.com 

Yun  Fu3 

raymondyunfu@gmail.com 


1  Department  of  CSE 
State  University  of  New  York 
Buffalo,  NY,  USA 

2Kitware,  Inc. 

Clifton  Park,  NY,  USA 

3  Department  of  ECE  and  College  of  CIS 
Northeastern  University 
Boston,  MA,  USA 


Abstract 


In  this  work,  we  focus  on  developing  features  and  approaches  to  represent  and  an¬ 
alyze  videography  styles  in  unconstrained  videos.  By  unconstrained  videos,  we  mean 
typical  consumer  videos  with  significant  content  complexity  and  diverse  editing  artifacts, 
mostly  with  long  duration.  Our  approach  constructs  a  videography  dictionary ,  which  is 
used  to  represent  each  video  clip  as  a  series  of  varying  videography  words.  In  addition 
to  conventional  features  such  as  camera  motion  and  foreground  object  motion,  two  novel 
features  including  motion  correlation  and  scale  information  are  introduced  to  charac¬ 
terize  videography.  Then,  we  show  that  unique  videography  signatures  from  different 
events  can  be  automatically  identified,  using  statistical  analysis  methods.  For  practical 
applications,  we  explore  the  use  of  videography  analysis  for  content-based  video  retrieval 
and  video  summarization.  We  compare  our  approaches  with  other  methods  on  a  large  un¬ 
constrained  video  dataset,  and  demonstrate  that  our  approach  benefits  video  analysis. 

1  Introduction 

Automatic  understanding  of  visual  content  in  unconstrained  Internet  video,  such  as  those 
found  on  consumer  video  sharing  sites  ( e.g .,  YouTube  and  Metacafe),  offers  an  interesting 
but  very  challenging  task.  These  videos  are  particularly  challenging  because  they  contain 
very  diverse  content;  they  are  captured  under  a  variety  of  camera  motion  conditions  (panning, 
zooming,  translating);  they  are  of  highly  variable  length  (from  minutes  to  hours);  and  they 
are  often  heavily  edited  (e.g.,  shot  stitching  and  adding  captions).  As  such,  unconstrained 
videos  are  qualitatively  very  different  and  even  more  challenging  than  widely-used  video 
datasets,  such  as  the  Hollywood  dataset  [□]  or  the  YouTube  Sports  dataset  [□],  in  which  video 
clips  contain  fairly  coherent  single  action  occurring  within  a  short  duration.  For  example, 
some  wedding  videos  from  video  sharing  websites  are  more  than  an  hour  long  and  they  are 
produced  by  stitching  shots  recorded  separately  across  the  entire  wedding  event.  Each  shot 
contains  fairly  different  content,  such  as  a  panning  camera  capturing  a  party  room  filled 
with  dancing  guests,  a  series  of  stitched  shots  of  each  guest  individually  congratulating  the 
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(b)  Videography  Dictionary 


(c)  Quantization  &  Learning 


(e)  Adaptive  Summarization 


Figure  1:  Framework  for  videography  analysis  and  applications  for  unconstrained  videos. 
See  text  for  details. 

wedding,  or  a  shot  that  zooms  in  on  the  bride  and  groom.  On  the  other  hand,  other  wedding 
videos  may  be  only  minutes  long,  and  only  contain  shots  of  the  key  events  of  the  ceremony. 

In  this  work,  we  present  an  approach  for  unsupervised  videography  analysis  for  this  type 
of  unconstrained  video.  Intuitively,  each  videography  can  be  understood  as  a  camera  direc¬ 
tor’s  direction  on  a  movie  script,  e.g .,  “capture  the  running  actress  by  panning  the  camera, 
to  have  her  face  appear  at  20  percent  size  of  the  video”.  The  idea  is  that  different  classes 
of  video  content  will  have  different  videography  styles — the  videography  style  of  a  wedding 
video  should  be  different  from  a  sports  video — and  so,  the  videography  style  should  provide 
a  valuable  signal  for  automated  content  analysis.  In  this  paper,  we  demonstrate  the  value  of 
videography  analysis  for  video  retrieval  by  event  class  and  for  video  summarization. 

In  our  approach,  we  assume  that  there  are  diverse  videography  styles  in  unconstrained 
videos,  which  are  discovered  as  a  videography  dictionary  via  unsupervised  clustering  on 
proposed  features.  Then,  a  video  clip  can  be  represented  as  a  series  of  segments  with  vary¬ 
ing  videography  words.  For  the  underlying  videography  features,  we  extend  conventional 
features  such  as  camera  motion  and  foreground  (FG)  object  motion  [0,  O,  □,  E9, 123]  by  in¬ 
corporating  two  novel  features:  motion  correlation  and  scale  information  (see  Sec.  3).  To  the 
best  of  our  knowledge,  our  work  is  the  first  to  address  the  explicit  learning  of  a  videography 
dictionary  based  on  such  a  rich  set  of  features  beyond  simple  camera  motions. 

The  overview  of  our  proposed  approaches  is  illustrated  in  Fig.  1.  We  first  (a)  extract 
the  videography  features  by  decomposing  the  video  into  segments  based  on  camera  motion- 
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derived  “shot  boundaries”,  separating  foreground/background  motion  within  each  segment, 
and  computing  a  series  of  features  (as  illustrated  in  Step  2  and  described  in  Sec.  3).  Then  we 
(b)  cluster  these  features  to  develop  a  videography  dictionary,  and  (c)  quantize  the  segments 
into  videography  style  words  and  learn  the  relationship  between  the  style  words  and  events. 
This  is  used  for  (d)  video  retrieval,  and  (e)  to  help  content-adaptive  video  summarization. 

For  retrieval,  we  compare  our  approach  with  alternative  methods  on  a  large  TRECVID 
multimedia  event  detection  (MED)  ’ll  video  dataset  [ID]  across  15  different  diverse  query 
collections,  and  show  that  the  videography  style  does  indeed  add  complementary  information 
(Sec.  5).  In  addition,  our  adaptive  summarization  approach  is  different  from  the  existing 
body  of  work  relying  on  fixed  rules  ( e.g .,  [123])  in  that  our  system  optimizes  summarization 
process  to  highlight  the  unique  content  of  the  given  test  videos  (Sec.  6). 

2  Related  Work 

The  idea  of  representing  videos  as  a  series  of  segments  based  on  motion  and/or  appearance 
characteristics  has  been  explored  to  some  extent,  either  as  part  of  integrated  systems  [HE,  E2, 
123]  or  on  its  own  [O,  HE].  Most  systems,  including  this  work,  incorporate  two  main  low- 
level  processing  steps:  (a)  shot  boundary  detection  [0,  HE,  123],  which  is  to  find  the  boundaries 
between  stitched  shots,  and  (b)  camera  motion  estimation  within  shots  [□,  O,  H3I,  033,  E3,  E3] 
to  further  decompose  shots  into  finer  sub- shot  units  based  on  evolving  camera  motion  types. 

It  is  worth  noting  that  we  incorporate  existing  state-of-the-art  methods  as  part  of  our 
feature  extraction  module,  and  focus  on  (a)  developing  novel  techniques  to  enable  high- 
level  videography  analysis  and  (b)  its  application  for  retrieval  and  summarization  based  on 
noisy  videography  quantization  as  intermediate  representations.  Shot  boundary  detection 
is  believed  to  be  largely  solved  [HE];  we  adopt  [123].  For  background  (BG)  camera  motion 
estimation,  we  extend  [H3,  El]  to  estimate  three  P/T/Z  camera  motion  parameters  from  KLT 
tracks  while  simultaneously  separating  the  tracks  into  FG/BG  groups.  We  found  that  other 
approaches  for  FG/BG  separation  such  as  [□,  O]  are  unsatisfactory  for  unconstrained  videos, 
possibly  due  to  the  complex  geometric  scene  structure  in  our  data. 

In  terms  of  videography  modeling,  the  methods  closest  to  our  work  are  [E3,  E3].  In 
[E3],  a  system  capable  of  both  summarization  and  retrieval  was  presented.  The  system  is 
mostly  based  on  hand-tuned  distance  metrics  and  rules  to  classify  shots  and  videos  into 
semantic  categories,  based  on  multiple  features  with  heavy  emphasis  on  appearance  (e.g., 
color  and  texture),  and  a  few  others  such  as  simple  camera  motion  primitives  (S/P/T/Z).  In 
our  retrieval  experiments  (Sec.  5),  we  compare  our  new  features  with  these  simpler  4  types 
of  camera  motion  primitives.  It  is  worth  noting  that  our  work  presents  results  primarily 
based  on  motion  information  without  relying  on  appearance  matching,  hence,  provides  a 
clearer  understanding  on  the  promise  of  motion-based  videography  modeling  alone  for  high- 
level  tasks.  Additionally,  since  our  approach  is  learning-based,  the  heavy  burden  to  tune 
system  parameters  is  alleviated.  In  [E3],  the  authors  present  seven  self-defined  videography 
styles  common  in  commercial  movies,  which  are  classified  per  shot  based  on  features  such 
as  motion,  appearance,  and  FG/BG  separation;  the  videography  quantization  is  based  on 
supervised  learning,  and  its  use  for  summarization  or  retrieval  is  not  studied.  In  contrast, 
our  approach  is  unsupervised  and  does  not  require  manually  labeled  training  data  for  sub¬ 
shot  classification,  and  hence  can  scale  up  for  unconstrained  videos  with  more  complex 
videography  styles  beyond  commercial  movies. 

For  video  retrieval  based  on  videography,  other  than  the  above-mentioned  related  work 
in  [E3],  [O]  used  simple  average  profiles  of  FG/BG  motion  magnitude  as  features.  In  [HE], 
the  correlation  between  different  categories  of  sports  videos  and  camera  motion  types  (e.g., 
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BG  Motion  Quantization 


Figure  2:  Videography  feature  extraction.  (Left  top)  Camera  motion  estimation  with  FG/BG 
separation.  (Left  middle  and  bottom)  Original  FG  motion  (green)  is  corrected  (yellow). 
(Right)  Distribution  of  extracted  videography  features,  and  a  clustering-based  quantization. 


S/P/T/Z)  and  their  transitions  were  studied,  but  without  a  notion  for  retrieval. 

Video  summarization  that  has  been  well  studied  in  multimedia  community  [0]  is  for¬ 
mulated  as  key  frame  extraction  problem  where  change  detection  is  commonly  used  based 
on  appearance  features  such  as  color  [IZ2].  Different  approaches  which  incorporate  overall 
camera  motion  include  [□,  E3].  However,  both  works  adopted  fixed  rules  for  all  videos. 

3  Videography  Features 

For  every  input  video,  our  approach  applies  two  main  processing  steps  to  extract  videography 
features,  as  illustrated  in  Fig.  1(a).  First,  a  two-level  motion  analysis  is  conducted  to  decom¬ 
pose  long  clips  into  sequences  of  segments  with  coherent  motion  types  (S/P/T/Z).  Second, 
multiple  features  related  to  motion  and  scale  patterns  are  measured  from  every  segment, 
which  are  used  to  characterize  videography.  For  both  steps,  we  utilize  densely  computed 
KLT  tracks  [O]  over  the  entire  clips  as  main  basis  for  the  derived  features. 

For  the  two-level  decomposition,  we  adopt  existing  state-of-the-art  methods,  as  men¬ 
tioned  in  Sec.  2.  In  the  first  phase,  we  use  a  shot  boundary  detection  (SBD)  algorithm  which 
relies  on  the  birth  and  death  ratio  of  KLT  tracks  [IZ3].  In  detail,  we  developed  two  SBD  mod¬ 
ules,  each  one  for  two  different  styles  of  boundaries,  namely:  Cut  (simple  abrupt  transition) 
and  Fade-Out-In  (common  gradual  transition),  which  account  for  majority  of  boundaries  in 
videos.  On  labeled  test  data  of  153  shot  boundaries,  the  precision  and  recall  are  0.95  and 
0.98  for  Cut ,  and  0.63  and  0.75  for  Fade-Out-In ,  which  are  fairly  good  results. 

Then,  the  second  phase  decomposes  each  shot  further  into  sub- segments  based  on  four 
camera  motion  types  (S/P/T/Z).  For  unconstrained  videos,  camera  motion  estimation  is  chal¬ 
lenging  due  to  the  complex  interplay  between  the  (apparent)  motion  of  background  (BG) 
and  foreground  (FG)  objects,  which  need  to  be  separated  to  yield  accurate  results.  We  adopt 
[O,  El]  because  of  its  proven  performance  on  unconstrained  videos  and  its  advantage  of 
solving  FG/BG  separation  simultaneously.  As  a  result,  KLT  tracks  are  grouped  into  BG  or 
FG,  where  BG  group  accounts  for  tracks  mostly  induced  by  camera  motion  and  FG  group  as 
outliers  from  BG.  Furthermore,  to  capture  motion  characteristics  of  FG  objects  accurately, 
FG  tracks  are  motion-corrected  by  subtracting  average  BG  motion.  These  are  illustrated  in 
Fig.  2(Left).  Although  FG/BG  separation  results  are  not  perfect,  the  portion  of  mis-classified 
tracks  is  usually  small,  hence,  unlikely  to  undermine  the  overall  videography  analysis. 
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Once  segments  are  obtained,  a  set  of  videography  features  are  extracted  from  every  seg¬ 
ment.  In  this  work,  we  focus  on  visual  features  related  to  motion  and  scale:  (1)  camera 
motion  type  (S/P/T/Z),  (2)  FG  and  (3)  BG  motion,  (4)  correlations  between  FG/BG  motion, 
and  (5)  the  scale  of  foreground.  For  FG  and  BG  motion,  the  average  motion  within  a  seg¬ 
ment  is  normalized  w.r.t.  the  video  width,  to  cope  with  video  clips  with  varying  sizes.  Our 
novel  FG/BG  correlation  feature  is  motivated  by  the  fact  that  similar  camera  motion  may  be 
invoked  by  different  intentions,  e.g.,  tracking  or  simply  switch  of  focus.  The  magnitudes  of 
FG/BG  correlation  are  measured  by  the  normalized  sum  of  inner  product  between  FG  tracks 
and  average  BG  motion.  We  also  include  scales  of  FG  objects  as  another  distinctive  feature 
for  videography.  For  example,  clips  with  close-up  shots  of  faces  are  very  different  from  clips 
which  contain  far-away  shots  of  pedestrians.  Because  the  estimation  of  scale  is  a  very  chal¬ 
lenging  problem,  in  this  work,  we  used  the  bounding  box  sizes  of  face  detections  produced 
by  off-the-shelf  systems  (e.g.,  [EH])  as  a  proxy  for  scale  estimates.  In  detail,  average  face 
size  within  a  segment  (normalized  by  the  video  height)  is  used  to  represent  the  scale.  For 
example,  face  scale  of  0.2  indicates  that  the  average  size  of  faces  occupies  about  20  percent 
of  the  image  height.  It  is  worth  noting  that,  there  are  alternative  approaches  for  scale  es¬ 
timation  by  solving  depth  [D3]  or  3D  geometry  [I2D] .  However,  applying  such  methods  for 
unconstrained  videos  is  beyond  the  scope  of  this  work,  and  is  left  for  future  work. 

For  our  experiments,  we  extracted  the  above-mentioned  videography  features  from  a 
training  video  dataset,  which  consists  of  roughly  2000  unconstrained  videos  (-80  hours  to¬ 
tal),  where  29  segments  are  found  per  clip  on  average.  The  overall  distribution  of  the  ex¬ 
tracted  features  are  shown  in  Fig.  2(Right),  where  the  multi-modal  characteristics  in  most 
videography  features  (except  FG  motion)  can  be  observed.  Such  patterns  indicate  that  there 
are  indeed  regularized  videography  patterns  in  videos. 

4  Videography  Dictionary  and  Analysis 

Once  videography  features  are  obtained  from  segments,  they  are  grouped  to  form  videogra¬ 
phy  dictionary  (VD)  shown  in  Fig.  1(b).  The  computed  VD  will  be  used  to  quantize  video 
clips  into  sequences  of  videography  words  (VWs),  as  shown  in  Fig.  1(c). 

We  have  explored  two  different  methods  for  developing  the  dictionary:  (1)  concatenated 
and  (2)  joint  learning.  In  the  first  concatenated  learning,  each  feature  dimension  is  quan¬ 
tized  individually,  then,  are  concatenated  to  form  VD  in  a  combinatoric  manner.  Straight¬ 
forwardly,  the  first  feature  dimension  of  camera  motion  type  has  four  quantization  values 
of  S/P/T/Z.  We  quantize  the  remaining  features  individually,  based  on  an  empirical  analy¬ 
sis  of  the  data  on  the  training  set.  As  illustrated  in  Fig.  2(right),  the  BG  and  FG  motion 
is  each  quantized  into  small/medium/large ;  the  FG/BG  correlation  into  correlation  or  no¬ 
correlation;  and  the  scale  into  no -face/ small/medium/large.  The  video  words  are  then  formed 
by  concatenating  these  values.  This  creates  4x3x3x2x4  =  288  possible  video  words. 

Our  analysis  of  the  distribution  of  the  resulting  VD  shows  that,  interestingly,  only  -40% 
of  the  words  are  actually  observed  in  the  data,  indicating  that  only  a  subset  of  combinations 
of  feature  quantizations  are  present,  e.g.,  a  combination  such  as  zoom-in,  large  FG  and  BG 
motion,  no  correlation,  and  large  scale  actually  does  not  appear.  Furthermore,  if  we  eliminate 
rare  words  which  have  fewer  than  ten  occurrences,  we  are  left  with  only  82  unique  videogra¬ 
phy  words,  over  a  dataset  of  80  hours  of  unconstrained  video.  Such  observation  provides  an 
insight  that  there  are  fairly  regularized  patterns  in  how  people  capture  videos,  regardless  of 
content.  To  the  best  of  our  knowledge,  this  is  the  first  study  that  provides  automated  analysis 
on  characteristics  of  videography  styles  on  unconstrained  Internet  videos. 
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(a)  Yideography  word  examples 


Board_Trick 

Feed_Animal 

Landing_Fish 

Wedding 
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Birthday_Party 

Change_Tire 

FlashMob 

VehicleUnstuck 

GroomAnimal 

Make_Sandwich 

Parkour 

lRepair_Appliance 

Sewing_Project| 


(b)  Mutual  info,  between  events  and  words 


■  BoardTrick  ■  Flash Mob 

■  Wedding Ceremony  ■  Birthday Party 
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(c)  Events  with  different  Styles 


Figure  3:  (a)  Videography  word  examples,  (b)  Mutual  information  between  different  event 
classes  and  most  frequent  50  VWs.  (c)  Qualitative  analysis  on  4  event  classes. 


In  the  second  method  of  joint  learning  for  developing  the  dictionary,  we  again  quantize 
the  motion  type  into  the  same  four  values  (S/P/T/Z).  However,  for  each  motion  type,  we  per¬ 
form  K-means  clustering  on  the  remaining  four-dimension  continuous  vector  space  formed 
by  concatenating  the  four  raw  feature  types  (FG  motion,  BG  motion,  amount  of  correlation, 
size  of  face).  In  our  experiments,  we  chose  K=30,  which  yields  4  x  30  =  120  video  words. 
We  used  a  smaller  number  of  clusters  because  of  the  observation  that  many  of  the  video 
words  from  the  first  method  were  actually  not  used. 

Once  VDs  are  obtained,  we  can  examine  their  accuracy  as  a  macro  feature  type  by  ex¬ 
amining  the  sample  video  segments  in  each  word  cluster.  Example  segments  belonging  to 
two  sample  videography  word  clusters  are  shown  in  Fig.  3(a),  along  with  the  detected  visual 
features  overlayed  on  images  to  show  more  details,  including  camera  motion  (left  bottom 
arrows),  compensated  FG  motion  (green  tracks),  and  face  detections  (orange  boxes)1.  The 
textual  descriptions  of  both  words  were  produced  manually,  by  looking  at  both  the  feature 
vector  values  and  the  grouped  segments.  It  can  be  observed  that  segments  with  highly  re¬ 
lated  content  are  successfully  grouped  into  the  same  VWs.  In  particular,  it  is  worth  noting 
that  in  the  second  example,  similar  segments  are  grouped  together  correctly,  even  though 
faces  are  not  detected  due  to  the  challenging  imaging  conditions.  We  have  manually  exam¬ 
ined  10  VWs  by  drawing  30  segment  samples  each  and  concluded  that,  on  average,  88%  of 
segments  from  the  same  VWs  show  perceptually  identical  videography. 

We  also  conducted  analysis  on  the  correlations  between  VWs  and  particular  visual  con¬ 
tent,  so  called  events.  By  events,  we  mean  semantic  content  classes  captured  in  videos,  such 
as  Flash  mob  or  Birthday  party  (defined  further  in  [□]).  This  notion  of  analyzing  or  learning 
about  videography  of  videos  containing  the  same  events  is  illustrated  in  Fig.  3(b,c).  Specifi¬ 
cally,  we  measured  the  mutual  information  (MI)  between  each  word  and  each  event.  A  high 
MI  score  indicates  that  a  word  is  discriminative  for  the  corresponding  event.  Our  results  are 
summarized  in  Fig.  3(b)  where  MI  between  every  event  and  top  50  most  frequent  VWs  are 
shown.  It  can  be  observed  that,  for  a  particular  event,  there  are  certain  signature  VWs.  More 


1  In  this  work,  faces  are  intentionally  occluded  in  this  figure  for  privacy. 
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detailed  analysis  is  shown  for  four  event  types  and  top  20  words,  in  Fig.  3(c).  In  particu¬ 
lar,  this  analysis  provides  insight  on  how  different  events  are  captured  with  different  styles. 
For  example,  it  shows  that  event  Board  trick  has  a  strong  style  of  tracking  moving  object ; 
event  Flash  mob  has  a  strong  style  of  browsing  scenes ;  event  Wedding  ceremony  shows  fre¬ 
quent  zooming ;  and  event  Birthday  party  shows  frequent  facial  close-up.  This  observation 
on  discriminative  correlations  suggests  that  videography  analysis  can  actually  be  used  for 
challenging  tasks  such  as  retrieval  (Sec.  5)  and  summarization  (Sec.  6). 

5  Application  for  Video  Retrieval 

In  this  section,  we  present  our  approach  and  experimental  results  for  videography-based 
video  retrieval.  In  detail,  we  computed  videography  word  bag-of-word  (VW-BoW)  repre¬ 
sentations,  where  per-clip  unigram  features  are  built  from  sequence  of  VWs  (regardless  of 
temporal  ordering),  for  every  clip.  The  goals  are  to  examine  (1)  how  well  the  proposed  VW- 
BoW  feature  can  perform  in  retrieval  tasks  by  itself,  compared  to  other  alternatives  and  with 
detailed  studies  on  contribution  of  each  videography  feature  component,  and  (2)  whether  our 
approach  offers  a  useful  modality  to  capture  characteristics  of  video  belonging  to  high-level 
event  classes,  in  comparison  to  other  macro-level  features  such  as  GIST  [□]. 

For  dataset,  we  use  TRECVID  2011  multimedia  event  detection  (MED)  corpus  [ffl]  as 
our  data,  due  to  its  large  size,  realistic  content  variability,  and  existing  clip-level  annota¬ 
tions  for  15  different  event  classes.  Both  the  scale  and  complexity  of  the  dataset  are  beyond 
the  widely-used  datasets  [□,  □].  Clips  are  frequently  captured  in  unconstrained  lighting  and 
camera  motion  conditions,  exhibiting  diverse  degrees  of  encoding  artifacts  and  severe  back¬ 
ground  clutter,  and  heavily  edited  by  owners  using  shot  stitching,  caption  embedding,  etc. 
For  training  data,  we  use  “Part-1  training  data”  (called  event  kits),  which  consists  of  videos 
from  15  different  event  classes  of  137  clips  per  class  on  average  (total  2061  clips)  with  av¬ 
erage  duration  of  4.2  minutes.  From  these  training  data,  our  VDs  are  computed  by  selecting 
the  best  run  out  of  100  K-means  clustering,  and  later  used  for  test  data.  The  15  event  types 
are  enlisted  in  the  caption  of  Fig.  4,  with  events  frequently  exhibiting  complex  camera  mo¬ 
tion  marked  in  bold  faces.  For  test  data,  MED  corpus  provides  two  different  subsets,  “Part-1 
DEV-T”  for  the  first  5  event  classes,  and  “MED1 1TEST”  for  the  remaining  10  event  classes, 
with  4292  and  32061  total  clips  respectively.  Both  test  datasets  contain  large  amount  of  neg¬ 
ative  clips  which  do  not  belong  to  any  of  the  target  event  classes,  consequently,  they  serve 
as  realistic  test-bed  for  retrieval  experiments.  The  positive  examples  in  the  two  test  datasets 
only  constitute  2.34%  and  0.37%  on  average  per  class  respectively. 

Our  retrieval  experiments  are  conducted  using  one-vs-all  SVM  classifiers,  parameters  of 
which  are  tuned  via  cross-validation.  The  overall  results  are  summarized  in  Fig.  4  and  Table 
1,  where  several  experiments  are  conducted2.  As  performance  metrics,  average  precision 
(AP)  is  used.  It  is  worth  noting  that  APs  for  E06-E15  are  lower  than  E01-E05,  because  the 
relative  ratio  of  negative  samples  in  the  test  dataset  for  E06-E15  is  about  10  times  higher.  In 
detail,  Chance  denotes  random  retrieval  and  PTZ  denotes  the  use  of  four-dimensional  BoWs 
of  discrete  camera  motion  types  only  ( e.g .,  S/P/T/Z)  without  detailed  videography  features, 
as  comparative  methods  [O,  123].  The  variations  of  our  approaches  are  marked  using  ab¬ 
breviations  where  J  and  C  denote  joint  or  concatenated  VD  learning,  described  in  Sec.  4. 
Additionally,  B,F,C,S  indicate  the  inclusion  of  BG  motion,  FG  motion,  BG/FG  correlation, 
and  scale  respectively,  during  VD  learning.  These  experiments  have  been  conducted  to  ex¬ 
amine  the  usefulness  of  each  videography  feature  for  retrieval.  The  minus  sign  5  indicates 

2Detailed  numerical  values  of  all  experimental  results  can  be  found  in  supplemantal  materials. 
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□  Chance 

■  PTZ 

■  J_BF 
■ J_BFC 
■ J_BFS 
■ J_BFCS 
■ J_BFCS- 

■  C_BFCS 

■  GIST 

■  Fusion 


Figure  4:  Average  Precision  (%)  of  video  retrieval  results  on  MED  corpus,  for  15  events: 
(E01)  Board  trick,  (E02)  Feeding  animal,  (E03)  Fishing,  (E04)  Wedding,  (E05)  Working 
wood  project,  (E06)  Birthday  party,  (E07)  Change  vehicle  tire,  (E08)  Flash  mob,  (E09) 
Getting  vehicle  unstuck,  (E10)  Groom  animal,  (Ell)  Make  sandwich,  (E12)  Parade,  (E13) 
Parkour,  (E14),  Repair  appliance,  and  (El 5)  Sewing  project. 


Table  1:  Mean  average  precision  (%)  of  video  retrieval  results  on  MED  corpus,  for  two 
separate  test  datasets  of  events  (left)  1-5  and  (right)  6-15  respectively.  Fusion  results  are 
obtained  by  combining  J_BFCS  and  GIST.  The  results  with  dynamic  events  only  are  marked 
with  (D),  which  include  events:  E01,  E04,  E06,  E08,  E12,  and  E13. 


mAP 

Chance 

PTZ 

J BFCS 

J BFCS(D) 

GIST 

GIST(D) 

Fusion 

Fusion(D) 

E01-E05 

2.34 

5.63 

13.61 

24.50 

8.57 

10.34 

17.74 

30.35 

E06-E15 

0.37 

0.62 

1.19 

2.08 

1.61 

2.22 

2.81 

4.99 

that  the  VD  has  been  pruned  by  filtering  out  VWs  with  low  MI  scores  per  event  type.  For 
all  the  experiments  with  BoW-type  features,  histogram  intersection  kernel  (HIK)  was  used 
for  SVM  training  and  testing.  In  addition,  GIST  shows  the  results  using  GIST  features  [O] 
with  linear  SVMs.  Because  GIST  is  a  per-image  feature,  GIST  features  are  computed  on 
frames  extracted  from  labeled  video  clips.  Then,  one-vs-all  SVMs  were  trained  on  image 
features  using  clip  labels.  For  testing,  SVMs  are  applied  on  extracted  images,  then,  scores 
were  averaged  to  produce  a  clip-level  score.  Apparently,  VWs  and  GIST  capture  very  dis¬ 
tinct  signals  from  data.  Accordingly,  in  the  experiment  marked  as  Fusion ,  we  have  further 
explored  whether  fusion  of  two  modalities  can  lead  to  further  improvement,  which  will  show 
whether  these  two  feature  types  are  complementary.  For  fusion,  we  have  used  the  approach 
of  “late  fusion”  ( e.g .,  [i])  where  we  have  used  the  weighted  sum  of  two  classifiers  as  the 
fusion  score.  Among  VW-based  approaches,  J_BFCS  was  used  because  it  has  been  shown 
to  provide  best  performance,  and  weights  were  determined  by  cross  validation  where  equal 
weights  of  <0.5,  0.5>  were  found  to  be  best. 

Overall,  we  can  observe  that  VWs  clearly  provide  advantage  over  the  conventional  sim¬ 
pler  alternative  of  using  camera  types  only,  i.e.,  PTZ.  From  Fig.  4,  it  can  also  be  observed 
that  every  videography  feature  contributes  towards  improving  performance.  Between  joint 
and  concatenated  VD  learning,  joint  learning  shows  superior  performance  overall,  possibly 
due  to  the  data-driven  construction  of  the  dictionary  which  avoids  many  empty  (or  rare)  VWs 
in  concatenated  learning.  However,  pruning  VWs  by  MI  scores  does  not  seem  to  necessarily 
boost  performance.  Table  1  shows  mean  average  precision  (mAP)  for  key  experiments  in 
Fig.  4  on  two  test  datasets.  It  can  be  observed  that  motion-based  macro  feature  such  as 
videography  can  outperform  GIST  for  E01-E05  in  “Part-1  DEV-T”  set,  and  E06,  E08,  Ell, 
E13,  E14  in  “MED11TEST”  set.  More  importantly,  the  fusion  results  are  much  better  than 
either  approach,  indicating  that  two  feature  types  are  complimentary.  Table  1  also  shows 
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mAPs  for  dynamic  events  only,  where  we  observe  big  boost  in  performance  for  VWs.  Inter¬ 
estingly,  the  event  classes  which  show  clear  discriminative  correlation  with  VWs  in  Fig.  3(b) 
are  dynamic  events,  and  they  also  show  more  advantage  when  VWs  are  used  for  retrieval. 


6  Application  for  Video  Summarization 


A 


Lrr 


Tracking  object 
(FG/BG  correlation) 


Zoom  in  medi u m  face 
Browse  scene 


1  11  21  31  41  51  61  71 

Segment  scores  based  on  videography  words 
(Synthetic  data) 


"H 


Frame  index  Frame  index 

Two  examples  of  key  frame  selection  (red  bars) 
within  segments  (Real  data) 


Frame  2326  Frame  2176  Frame  2701  Frame  2776  Frame  2801  Frame  1576 


Frame  47  Frame  1187  Frame  1267  Frame  1345  Frame  1497  Frame  1996 


Frame  271_ Frame  9 


Frame  2881  Frame  3511  Frame  2791  Frame  2971 


Frame  298  Frame  424  Frame  945  Frame  9i 


Frame  1 1 73  Frame  1 289 


Birthday  Party 


Frame  801_ Frame  321  Frame  353  Frame  545  Frame  1185  Frame  1057 


Frame  33 _ Frame  301 _ Frame  459 _ Frame  630 _ Frame  907 _ Frame  1 028 


Figure  5:  Videography-aware  adaptive  summarization.  (Left)  Segment  scores  are  based 
on  Mis  of  corresponding  VWs.  Frames  are  selected  at  designated  relative  location  within 
segments.  (Right)  Three  summarization  results  by  this  work  (red  rows)  and  baseline  (blue 
rows).  Detected  FG  regions  (green)  and  human  judgements  on  relevance  of  key  frames 
(good:none,  near-miss:  yellow,  miss:  red)  to  associated  events  are  marked  on  each  image. 


In  this  section,  we  present  our  videography-aware  adaptive  summarization  method,  which 
is  designed  to  highlight  the  segments  with  distinctive  videography  styles  for  particular  events. 
Our  novel  insight  is  that  identification  of  segments  from  videos  where  cameramen  are  sys¬ 
tematically  exhibiting  distinctive  videography  styles  for  particular  events  will  provide  unique 
summarization,  assuming  that  such  segments  are  strongly  correlated  with  the  major  region 
of  interest.  While  many  works  deliberately  avoid  the  use  of  segments  with  motion  due  to 
complexity,  e.g.,  [HD],  such  segments  can  be  indeed  crucial  to  characterize  dynamic  contents 
in  videos  exhibiting  frequent  camera  motion,  frequently  recorded  by  mobile  devices. 

In  our  approach,  frames  are  extracted  by  two  step  procedures,  as  illustrated  in  Fig. 
5(Left).  First,  key  segments  are  selected  based  on  segment  scores,  with  optional  weighted 
sampling  scheme  in  case  there  are  more  number  of  segments  than  the  desired  number  of  key 
frames.  For  segment  scores,  MI  scores  have  been  used3.  Then,  key  frames  are  extracted,  one 
per  selected  segment.  In  particular,  our  novel  innovation  is  that  frames  are  designed  to  be 
extracted  from  different  relative  location  within  each  segment  based  on  their  videography. 
Two  different  types  of  key  frame  selection  mechanisms  were  used:  frames  are  selected  (1) 
in  the  middle  of  segments  when  videography  is  either  stationary  or  indicates  FG/BG  corre¬ 
lation  (to  capture  peak  of  FG  motion),  and  (2)  at  either  end  of  segments  when  videography 
indicates  P/T/Z  without  FG/BG  correlation  (to  capture  the  destination  of  shifting  attention). 

Qualitative  summarization  results  are  shown  in  Fig.  5 (Right),  where  frames  extracted 
from  same  videos  by  our  proposed  method  (red  rows)  and  a  conventional  baseline  (blue 
rows)  are  compared,  for  three  different  event  classes.  The  results  of  the  baseline  method  were 


3 Without  event  labels,  term  frequency  inverse  document  frequency  (tf-idf)  scores  [S]  can  be  used  instead. 
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obtained  by  extracting  frames  with  highest  scores  based  on  color  histogram  changes,  which 
is  very  common  [0] .  It  can  be  observed  that  our  method  is  very  effective  in  identifying  unique 
contents  from  clips.  In  particular,  most  extracted  frames  contain  important  visual  moments 
when  the  FG  people  are  at  the  peak  of  their  action  or  camera  focus,  such  as  skilled  jumps 
or  before  blowing  a  birthday  cake  candle.  On  the  other  hand,  results  by  the  baseline  tend  to 
include  frames  that  just  exhibit  strong  changing  background  or  even  black  frames  around  the 
captions  inserted  by  users.  Overall,  we  observe  that  the  proposed  method  can  generate  good 
visual  summaries,  especially  for  clips  which  contain  complex  camera  motions. 

7  Conclusion 

We  have  presented  our  framework  for  videography  learning  and  analysis,  and  its  applica¬ 
tion  for  video  summarization  and  retrieval.  The  introduced  features  and  data-driven  VD 
learning  helps  to  identify  characteristic  videography  among  videos  from  same  events.  Our 
experiments  show  that  meaningful  summarization  and  retrieval  results  can  be  obtained  us¬ 
ing  videography.  Fusion  results  indicate  that  videography  features  capture  unique  aspects  of 
videos  and  can  be  jointly  used  with  other  features  to  improve  retrieval  substantially. 
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